{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> [Kaggle入门教程](http://www.cnblogs.com/cutd/p/5713520.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 查看数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 属性解释\n",
    "- PassengerId：一个用以标记每个乘客的数字id\n",
    "- Survived：标记乘客是否幸存——幸存(1)、死亡(0)。我们将预测这一列。\n",
    "- Pclass：标记乘客所属船层——第一层(1),第二层(2),第三层(3)。\n",
    "- Name：乘客名字。\n",
    "- Sex：乘客性别——男male、女female\n",
    "- Age：乘客年龄。部分。\n",
    "- SibSp：船上兄弟姐妹和配偶的数量。\n",
    "- Parch：船上父母和孩子的数量。\n",
    "- Ticket：乘客的船票号码。\n",
    "- Fare：乘客为船票付了多少钱。\n",
    "- Cabin：乘客住在哪个船舱。\n",
    "- Embarked：乘客从哪个地方登上泰坦尼克号。\n",
    "\n",
    "我们都知道妇女和儿童更可能被救。因此，年龄和性别很可能更好的帮助我们预测。认为乘客的船层可能会影响结果也是符合逻辑的，因为第一层的船舱更靠近船的甲板。Fare票价和乘客所住船层相关，而且可能是高度相关的，但是也可能会增加一些额外的信息。SibSp、Parch兄弟姐妹、配偶、父母/孩子的数量很可能关系到是否被某一个或很多个人救，会有很多人去帮助你或者有很多人想到你尝试去救你。\n",
    "像Embarked登船(也许有一些信息和怎么靠近船的顶部的人的船舱有关),Ticket票号和Name名字。\n",
    "这一步通常是习得相关的领域知识[要对业务比较深入]，这对于绝大多数机器学习任务来说非常非常非常的重要。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "# We can use the pandas library in python to read in the csv file.\n",
    "# This creates a pandas dataframe and assigns it to the titanic variable.\n",
    "titanic = pd.read_csv(\"Data/train.csv\")\n",
    "\n",
    "print(titanic.describe())  # 只显示数值列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(titanic.head(3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 数据清洗\n",
    "- 使用中位数填充年龄缺失值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 非数值列\n",
    "- Name, Sex, Cabin, Embarked, Ticket\n",
    "- 忽略Ticket, Cabin,Name，因为Cabin缺失太严重，而且起初看起来好像也不是特别有用的列。如果没有一些关于船票号码意义的专业领域知识和哪些名字的特点像大户或者有钱家庭，那么Ticket和Name列也不太可能会告诉我们太多的信息。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(titanic.isnull().sum(axis=0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 性别处理\n",
    "- 将0赋值给male，将1赋值给female"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "titanic.loc[titanic['Sex']=='male','Sex'] = 0\n",
    "titanic.loc[titanic['Sex']=='female','Sex'] = 1\n",
    "print(titanic.head(3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 登船Embarked列\n",
    "- Embarked列的唯一值是S，C，Q和缺失值nan。每个字母是一个登船港口名字的缩写\n",
    "- 将0赋值给S，将1赋值给C，将2赋值给Q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(titanic['Embarked'].unique())\n",
    "titanic['Embarked'] = titanic['Embarked'].fillna('S')\n",
    "titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0\n",
    "titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1\n",
    "titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2\n",
    "print(titanic.head(3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 进行预测操作\n",
    "- 使用scikit-learn库来做预测。用skelearn的一个助手来将数据分成交叉验证的子类，然后用每一个子类分别来训练算法做预测。最后，我们将得到一个预测列表，每一个列表项包含了相关子类的预测数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Import the linear regression class\n",
    "from sklearn.linear_model import LinearRegression\n",
    "# Sklearn also has a helper that makes it easy to do cross validation\n",
    "from sklearn.model_selection import KFold\n",
    "\n",
    "# The columns we'll use to predict the target\n",
    "predictors = [\"Pclass\", \"Sex\", \"Age\", \"SibSp\", \"Parch\", \"Fare\", \"Embarked\"]\n",
    "\n",
    "# Initialize our algorithm class\n",
    "alg = LinearRegression()\n",
    "# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.\n",
    "# We set random_state to ensure we get the same splits every time we run this.\n",
    "kf = KFold(n_splits=3, random_state=1)\n",
    "\n",
    "predictions = []\n",
    "for train, test in kf.split(titanic['Age']):\n",
    "    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.\n",
    "    train_predictors = (titanic[predictors].iloc[train,:])\n",
    "    # The target we're using to train the algorithm.\n",
    "    train_target = titanic[\"Survived\"].iloc[train]\n",
    "    # Training the algorithm using the predictors and target.\n",
    "    alg.fit(train_predictors, train_target)\n",
    "    # We can now make predictions on the test fold\n",
    "    test_predictions = alg.predict(titanic[predictors].iloc[test,:])\n",
    "    predictions.append(test_predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 处理测试集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic_test = pandas.read_csv(\"Data/test.csv\")\n",
    "titanic_test['Age'] = titanic_test['Age'].fillna(titanic_test['Age'].median())\n",
    "titanic_test['Fare'] = titanic_test['Fare'].fillna(titanic_test['Fare'].median())\n",
    "titanic_test.loc[titanic_test['Sex'] == 'male','Sex'] = 0\n",
    "titanic_test.loc[titanic_test['Sex'] == 'female','Sex'] = 1\n",
    "titanic_test['Embarked'] = titanic_test['Embarked'].fillna('S')\n",
    "titanic_test.loc[titanic_test['Embarked'] == 'S', 'Embarked'] = 0\n",
    "titanic_test.loc[titanic_test['Embarked'] == 'C', 'Embarked'] = 1\n",
    "titanic_test.loc[titanic_test['Embarked'] == 'Q', 'Embarked'] = 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 生成一个提交文件\n",
    "- 首先，我们必须先在训练数据上训练一个算法。然后我们在测试数据集上做一个预测。最后，我们生成一个包含预测和乘客id的csv文件。\n",
    "- 一旦你完成了下面步骤的代码，你就能用submission.to_csv('submission.csv', index=False)输出一个csv结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Initialize the algorithm class\n",
    "alg = LogisticRegression(random_state=1)\n",
    "\n",
    "# Train the algorithm using all the training data\n",
    "alg.fit(titanic[predictors], titanic[\"Survived\"])\n",
    "\n",
    "# Make predictions using the test set.\n",
    "predictions = alg.predict(titanic_test[predictors])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "predictions = predictions.astype(int)\n",
    "# Create a new dataframe with only the columns Kaggle wants from the dataset.\n",
    "submission = pandas.DataFrame({\n",
    "        \"PassengerId\": titanic_test[\"PassengerId\"],\n",
    "        \"Survived\": predictions\n",
    "    })\n",
    "submission.to_csv('Data/submission.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 完整代码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#!/usr/bin/env python\n",
    "# coding=utf-8\n",
    "# Filename : first.py\n",
    "# Created by iFantastic on 2017/8/14\n",
    "# Description : http://www.cnblogs.com/cutd/p/5713520.html\n",
    "\n",
    "import pandas as pd\n",
    "# Import the linear regression class\n",
    "from sklearn.linear_model import LinearRegression\n",
    "# Sklearn also has a helper that makes it easy to do cross validation\n",
    "from sklearn.model_selection import KFold\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "# 读取训练数据\n",
    "titanic = pd.read_csv('Data/train.csv')\n",
    "\n",
    "# 处理Age中的缺失值\n",
    "titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())\n",
    "# 量化Sex\n",
    "titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 0\n",
    "titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 1\n",
    "# 量化Embarked\n",
    "titanic['Embarked'] = titanic['Embarked'].fillna('S')\n",
    "titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0\n",
    "titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1\n",
    "titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2\n",
    "\n",
    "# 进行预测操作\n",
    "# The columns we'll use to predict the target\n",
    "predictors = [\"Pclass\", \"Sex\", \"Age\", \"SibSp\", \"Parch\", \"Fare\", \"Embarked\"]\n",
    "\n",
    "# Initialize our algorithm class\n",
    "alg = LinearRegression()\n",
    "# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.\n",
    "# We set random_state to ensure we get the same splits every time we run this.\n",
    "folder = KFold(n_splits=3, random_state=1)\n",
    "\n",
    "predictions = []\n",
    "for train, test in folder.split(titanic[predictions]):\n",
    "    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.\n",
    "    train_predictors = (titanic[predictors].iloc[train, :])\n",
    "    # The target we're using to train the algorithm.\n",
    "    train_target = titanic[\"Survived\"].iloc[train]\n",
    "    # Training the algorithm using the predictors and target.\n",
    "    alg.fit(train_predictors, train_target)\n",
    "    # We can now make predictions on the test fold\n",
    "    test_predictions = alg.predict(titanic[predictors].iloc[test, :])\n",
    "    predictions.append(test_predictions)\n",
    "\n",
    "# 处理测试集\n",
    "titanic_test = pd.read_csv(\"Data/test.csv\")\n",
    "titanic_test['Age'] = titanic_test['Age'].fillna(titanic_test['Age'].median())\n",
    "titanic_test['Fare'] = titanic_test['Fare'].fillna(titanic_test['Fare'].median())\n",
    "titanic_test.loc[titanic_test['Sex'] == 'male', 'Sex'] = 0\n",
    "titanic_test.loc[titanic_test['Sex'] == 'female', 'Sex'] = 1\n",
    "titanic_test['Embarked'] = titanic_test['Embarked'].fillna('S')\n",
    "titanic_test.loc[titanic_test['Embarked'] == 'S', 'Embarked'] = 0\n",
    "titanic_test.loc[titanic_test['Embarked'] == 'C', 'Embarked'] = 1\n",
    "titanic_test.loc[titanic_test['Embarked'] == 'Q', 'Embarked'] = 2\n",
    "\n",
    "# 用测试集进行预测\n",
    "# Initialize the algorithm class\n",
    "alg = LogisticRegression(random_state=1)\n",
    "\n",
    "# Train the algorithm using all the training data\n",
    "alg.fit(titanic[predictors], titanic[\"Survived\"])\n",
    "\n",
    "# Make predictions using the test set.\n",
    "predictions = alg.predict(titanic_test[predictors])\n",
    "\n",
    "# 保存结果\n",
    "# Create a new dataframe with only the columns Kaggle wants from the dataset.\n",
    "predictions = predictions.astype(int)\n",
    "submission = pd.DataFrame({\n",
    "        \"PassengerId\": titanic_test[\"PassengerId\"],\n",
    "        \"Survived\": predictions\n",
    "    })\n",
    "submission.to_csv('Data/submission.csv', index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
